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Abstract 

We  introduce  a  novel  algorithm,  TOQR,  for  relaxing 
failed  queries  over  databases;  i.e..  over-constrained  DNF 
queries  that  return  an  empty  result.  TOQR  uses  a  small 
dataset  to  discover  the  implicit  relationships  among 
the  domain  attributes,  and  then  it  exploits  this  domain 
knowledge  to  relax  the  failed  query.  TOQR  starts  with 
a  relaxed  query  that  does  not  include  any  constraint, 
and  it  tries  to  add  to  it  as  many  as  possible  of  the  origi¬ 
nal  constraints  or  their  relaxations.  The  order  in  which 
the  constraints  are  added  is  derived  from  the  domain’s 
causal  structure,  which  is  learned  by  applying  the  TAN 
algorithm  to  the  small  training  dataset.  Our  experiments 
show  that  TOQR  clearly  outperforms  other  approaches: 
even  when  trained  on  a  handful  of  examples,  it  success¬ 
fully  relaxes  more  that  97%  of  the  failed  queries;  fur¬ 
thermore,  TOQR’s  relaxed  queries  are  highly  similar  to 
the  original  failed  query. 

Introduction 

Manually  relaxing  failed  queries ,  which  do  not  match  any 
tuple  in  a  database,  is  a  frustrating,  tedious,  time-consuming 
process.  Automated  query  relaxation  algorithms  (Gaaster- 
land  1997;  Chu  el  al.  1996b)  are  typically  trained  offline 
to  acquire  domain  knowledge  that  is  then  used  to  relax  all 
failed  queries.  In  contrast,  LOQR  (Muslea  2004)  takes  an 
online,  query-guided  approach:  the  domain  knowledge  is 
extracted  online  in  a  process  driven  by  the  actual  constraints 
in  each  failed  query.  Even  though  this  approach  was  shown 
to  be  extremely  successful,  when  trained  on  small  datasets 
LOQR  tends  to  generate  short  queries  that  contain  only  a 
small  fraction  of  the  constraints  from  the  failed  query. 

We  introduce  a  novel  algorithm,  TOQR,  that  is  similar  to 
LOQR,  without  sharing  its  weakness:  even  when  trained  on 
just  a  handful  of  examples,  TOQR  generates  non-failing  re¬ 
laxed  queries  that  are  highly  similar  to  the  failed  ones.  In 
order  to  better  explain  our  contribution,  let  us  first  summa¬ 
rize  the  similarities  between  TOQR  and  LOQR.  They  both 
use  a  small  dataset  V  to  generate  -  via  machine  learning  - 
queries  Qi  that  are  then  used  to  relax  the  failed  query.  These 
queries  Qi  are  created  so  that  (1)  they  are  as  similar  as  pos¬ 
sible  to  the  failed  query  and  (2)  they  do  not  fail  on  V  (if  V  is 
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representative  of  the  target  database  TD,  it  follows  that  Qi 
is  unlikely  to  fail  on  TD). 

Our  main  contribution  is  a  novel  approach  to  generate 
non-failing  queries  (f,  that  are  highly  similar  to  the  failed 
query  Qf.  In  contrast  to  LOQR,  which  uses  V  to  learn  deci¬ 
sion  rules  that  are  then  converted  into  queries,  TOQR  starts 
with  an  empty  query  Qi,  to  which  it  tries  to  add  as  many  as 
possible  of  Qf  s  constraints  (or  their  relaxations).  The  order 
in  which  TOQR  considers  the  constraints  is  derived  from  the 
domain’s  causal  structure,  which  is  learned  from  V. 

More  precisely,  for  each  constraint  in  Qf,  TOQR  uses  TAN 
(Friedman,  Goldszmidt,  &  Lee  1998)  to  learn  both  the  topol¬ 
ogy  and  the  parameters  of  a  Bayesian  network  that  predicts 
whether  that  constraint  is  satisfied.  TOQR  tries  to  add  the  Q  f 
constraints  to  Qi  in  the  order  of  the  breadth-first  traversal  of 
the  learned  topology.  Intuitively,  the  breadth-first  traversal 
minimizes  the  conflicts  between  the  constraints  on  the  var¬ 
ious  domain  attributes  (in  the  TAN-generated  Bayesian  net¬ 
work,  the  nodes  are  independent  of  each  other,  given  their 
parents).  Our  empirical  evaluation  shows  that  the  order  in 
which  the  attributes  are  considered  is  critical:  adding  the 
constraints  to  Qi  in  an  arbitrary  order  leads  to  a  significantly 
poorer  performance. 

Related  Work 

CO-OP  (Kaplan  1982)  was  the  first  system  to  address  the 
problem  of  failing  queries.  CO-OP  transforms  the  failed 
query  into  an  intermediate,  graph-oriented  language  in 
which  the  connected  sub-graphs  represent  the  query’s  pre¬ 
suppositions.  CO-OP  tests  each  of  these  presupposition 
against  the  database  by  converting  the  subgraphs  into  sub¬ 
queries.  FLEX  (Motro  1990),  which  is  a  generalization  of 
CO-OP,  is  highly  tolerant  to  incorrect  queries  because  of  its 
ability  to  iteratively  interpret  the  query  at  lower  levels  of  cor¬ 
rectness.  When  possible,  FLEX  proposes  non-failing  queries 
that  are  similar  to  the  failing  ones;  otherwise  it  just  provides 
an  explanation  for  the  query’s  failure. 

As  finding  all  minimal  failing  and  maximal  succeeding 
sub-queries  is  NP-hard  (Godfrey  1997),  CO-OP  and  FLEX 
have  a  high  computational  cost,  which  comes  from  evalu¬ 
ating  a  large  number  of  queries  against  the  entire  database. 
To  speed  up  the  process,  (Motro  1986)  introduces  heuris¬ 
tics  for  constraining  the  search,  while  (Gaasterland  1997) 
controls  the  query  relaxation  process  via  heuristics  based  on 
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semantic  query-optimization. 

CoBase  (Chu  et  al.  1996a;  1996b;  Chu,  Chen,  &  Huang 
1994),  which  uses  machine  learning  techniques  to  relax 
the  failed  queries,  is  the  closest  approach  to  TOQR  and 
LOQR.  By  clustering  all  the  tuples  in  the  target  database 
(Merzbacher  &  Chu  1993),  CoBase  automatically  gener¬ 
ates  Type  Abstraction  Hierarchies  (TAHs)  that  synthesize  the 
database  schema  and  tuples  into  a  compact  form.  To  relax 
a  failing  query,  CoBase  uses  three  types  of  TAH-based  op¬ 
erators:  generalization,  specialization,  and  association  (i.e., 
moving  up,  down,  or  between  the  hierarchies,  respectively). 
Note  that  CoBase  performs  the  clustering  only  once,  on  the 
entire  database ,  and  independently  of  the  actual  constraints 
in  the  failing  queries.  In  contrast,  TOQR’s  learning  process 
is  performed  online  and  is  driven  by  the  constraints  in  each 
individual  query;  furthermore,  TOQR  uses  only  a  small  train¬ 
ing  set,  thus  not  requiring  access  to  all  the  tuples  in  the  target 
database. 

The  Intuition 

Consider  an  illustrative  laptop  domain,  in  which  the  query 

Qf  :  Price  <  $2,000  f\CPU  >2.5  GHz  /\ 

Display  >  17"  /\  Weight  <3  lbs /\  HDD  >  60GB 

fails  because  laptops  under  3  lbs  have  displays  smaller  than 
17"  (and,  vice-versa,  laptops  with  displays  over  17"  weigh 
more  than  3  lbs). 

In  order  to  relax  Qf ,  TOQR  proceeds  in  three  steps:  first, 
it  uses  a  small  dataset  to  learn  the  domain’s  causal  structure, 
which  is  then  exploited  to  generate  queries  that  are  guaran¬ 
teed  not  to  fail  on  the  training  data.  Second,  it  identifies  the 
generated  query  QSim  that  is  most  similar  to  Qf\  finally,  it 
uses  the  constraints  from  QSim  to  relax  Qf. 

Step  1:  Extracting  domain  knowledge 

TOQR  uses  a  small  dataset  V  to  discover  knowledge  that 
can  be  used  for  query  relaxation.  TOQR  considers  Qf  s 
constraints  independently  of  each  other  and  learns  from  V 
“what  does  it  take”  to  fulfill  each  particular  constraint.  This 
knowledge  is  then  used  to  create  a  relaxed  query  that  is  sim¬ 
ilar  to  Qf ,  but  does  not  fail  on  V  (as  already  mentioned,  if 
V  is  representative  of  the  target  database  TD,  the  relaxed 
query  is  also  unlikely  to  fail  on  TD). 

For  example,  consider  the  dataset  V  in  Table  1,  which 
consists  of  various  laptop  configurations.  In  order  to  learn  to 
predict  whether  Price  <  $2, 000  is  satisfied,  TOQR  creates  a 
duplicate  D\  of  V:  for  each  example  in  D  \ ,  TOQR  replaces 
the  original  value  of  Price  by  a  binary  one  that  indicates 
whether  or  not  Price  <  $2, 000  is  satisfied;  finally,  the  binary 
attribute  Price  is  designated  as  D\ ’s  class  attribute. 

In  order  to  discover  “what  does  it  take”  to  satisfy  the  con¬ 
straint  Price  <  $2, 000,  TOQR  uses  Di  to  train  the  TAN 
learner  (Friedman,  Goldszmidt,  &  Lee  1998).  From  the 
given  data,  TAN  learns  both  the  topology  and  the  parame¬ 
ters  of  a  Bayesian  network  classifier.  In  order  to  keep  the 
computation  tractable,  TAN  considers  only  topologies  that 
are  similar  to  the  one  shown  in  Figure  1: 


RAM 

Price 

CPU 

HDD 

Weight 

Screen 

1024 

$2299 

3.0  GHz 

50  GB 

3.1  lbs 

18" 

128 

$1999 

1.6  GHz 

80  GB 

3.6  lbs 

14“ 

64 

$1999 

2.0  GHz 

20  GB 

2.9  lbs 

12“ 

512 

$1898 

2.5  GHz 

60  GB 

4.3  lbs 

16“ 

256 

$1998 

2.8  GHz 

60  GB 

4.1  lbs 

17“ 

Table  1 :  The  dataset  V. 


-  the  network’s  root  (i.e..  Price)  influences  the  values  of  all 
domain  attributes  (see  dotted  arrows  in  Figure  1); 

-  each  non-class  node  can  also  have  at  most  an  additional 
parent  (e.g..  Display  also  depends  on  Weight). 

One  can  read  the  network  in  Figure  1  as  follows:  besides  the 
dependencies  on  the  class  attribute,  the  only  significant  de¬ 
pendencies  discovered  by  TAN  are  the  influence  of  Weight 
on  Display ,  and  of  CPU  on  RAM  and  HDD.  Intuitively, 
this  means  that  once  we  set  the  class  value  (e.g.,  comput¬ 
ers  under  $2, 000),  the  only  interactions  among  the  values  of 
the  attributes  are  the  one  between  Weight  and  Display,  and 
the  one  among  CPU,  RAM,  and  HDD.  In  turn,  this  im¬ 
plies  that  Qf  s  failure  is  due  to  the  incompatible  values  of 
the  attributes  in  one  or  both  of  these  groups  of  attributes. 

TOQR  uses  the  extracted  domain  knowledge  (i.e.,  the 
Bayesian  network)  to  generate  a  query  Q'price  that  is  sim¬ 
ilar  to  Qf  but  does  not  fail  on  V.  TOQR  starts  with  an  empty 
query  Q'price,  to  which  it  tries  to  add  -  one  at  the  time  -  the 
constraints  from  Qf.  The  order  in  which  TOQR  considers 
the  constraints  is  derived  from  the  topology  of  the  Bayesian 
network.  More  precisely,  TOQR  proceeds  as  follows: 

-  it  detects  all  the  constraints  imposed  by  Qf  on  the  parent¬ 
less  nodes  in  the  network; 

-  it  ranks  these  constraints  by  the  number  of  tuples  in  V  that 
satisfy  both  the  current  Q'  and  the  constraint  itself; 

-  it  greedily  tries  to  add  to  Q'Price  as  many  as  possible  of 
the  constraints  (one  at  the  time,  higher-ranked  first).  If 
adding  a  constraint  C  leads  to  the  failure  of  Q'price,  then 
C  is  removed  and  the  process  continues  with  next  highest- 
ranking  constraint. 

-  it  deletes  from  the  Bayesian  network  the  parentless  nodes, 
together  with  the  directed  edges  leaving  them; 

-  it  repeats  the  steps  above  until  all  the  nodes  are  visited. 

In  our  running  example,  the  algorithm  above  is  executed 
as  follows.  In  the  first  iteration,  TOQR  starts  with  an  empty 
Q'price  and  considers  the  Price  attribute  (as  the  network’s 
root,  this  is  the  only  parentless  node).  By  adding  to  Q'price 
the  Price  constraint,  we  get  a  Q'price  =  Price  <  $2,  000, 
which  matches  several  tuples  in  V.  As  there  are  no  other 
parentless  nodes  to  consider,  the  Price  node  and  all  the  dot¬ 
ted  arcs  are  deleted  from  the  network.  The  remaining  forest 
consists  on  two  trees,  rooted  in  Weight  and  CPU,  respec¬ 
tively.  These  two  new  parentless  nodes  are  the  ones  consid¬ 
ered  by  TOQR  in  the  next  iteration. 

It  is  easy  to  see  that  TOQR  ranks  the  CPU  constraint 
higher  than  the  Weight  one:  among  the  tuples  matching 


Figure  1:  Bayesian  network  learned  by  TAN. 

Q' Price  bottom  four  in  Table  1),  CPU  >  2.5  GHz 

matches  two  tuples,  while  Weight  <  3  lbs  matches  only  one. 
Consequently,  TOQR  tries  first  to  add  the  CPU  constraint, 
and  Q'Price  becomes  Price  <  $2,  000  /\  CPU  >  2.5  GHz. 

Then  TOQR  tries  to  add  Weight  <  3  lbs  to  this  new 
Q'price ,  but  the  resulting  query  Price  <  $2,000  f\CPU  > 
2.5 GHz  /\  Weight  <  3 lbs  does  not  match  any  tuple  from  Ta¬ 
ble  1;  consequently,  the  Weight  constraint  is  removed  from 
Q1  Price  ■  As  both  parentless  nodes  were  considered,  TOQR 
deletes  the  nodes  Weight  and  CPU,  together  with  edges 
originating  in  them  (i.e.,  all  remaining  arcs). 

In  the  last  iteration,  TOQR  considers  the  nodes  Display 
and  HDD  ( RAM  is  ignored  because  it  does  not  appear 
in  Qf).  Of  the  two  tuples  matched  by  Q!price  (i.e.,  the 
bottom  ones  in  Table  1),  Display  >  17"  matches  only 
one,  while  HDD  >  60 GB  matches  both.  Consequently, 
HDD  >  60 GB  is  added  to  Q'price ,  which  becomes  Price  < 
$2,  000  A  CPC  >  2.5  GHz /\  HDD  >  60 GB  (and  still 
matches  the  same  two  tuples).  Then  TOQR  adds  Display  > 
17"  to  Q'Price\  as  the  resulting  query  matches  one  tuple  in 
V,  we  get  the  final  result 

Q'price  :  Price  <  $2,  000  f\ CPU  >  2.5  GHz  /\ 

Display  >  17"  /\  HDD  >  60 GB 

TOQR  performs  the  algorithm  above  once  for  each  con¬ 
straint  in  Qf;  i.e.,  TOQR  also  creates  the  datasets  Di  —  D$, 
in  which  the  binary  class  attributes  reflect  whether  or  not  the 
constraints  on  CPU,  HDD,  Weight,  and  Display  are  satisfied, 
respectively.  Then  TAN  is  applied  to  each  of  these  datasets, 
and  the  corresponding  queries  Q'CPU,  Q'HDD,  Q  weight’ 
and  Qmspiay  are  generated. 

At  this  point,  let  us  re-emphasize  that  the  learning  process 
above  takes  place  online,  for  each  failing  query  Qf;  further¬ 
more,  the  process  is  also  query-guided  in  the  sense  that  each 
of  the  datasets  D\  —  D$  is  created  at  runtime  by  using  the 
actual  constraints  from  the  failed  query.  This  online,  query- 
guided  nature  of  both  TOQR  and  LOQR  distinguishes  them 
from  all  other  existing  approaches. 

The  key  characteristic  of  TOQR,  which  also  represents  our 
main  contribution,  is  the  use  of  the  learned  network  topology 
for  deriving  the  order  in  which  the  constraints  are  added  to 
Q' .  As  our  experiments  will  show,  an  algorithm  identical 
to  TOQR  -  except  that  it  adds  the  constraints  in  an  arbitrary 
order  -  performs  significantly  worse  than  TOQR.  Intuitively, 
this  is  due  to  the  fact  that  -  early  in  the  search  -  one  may 


inadvertently  add  a  constraint  that  dramatically  constraints 
the  range  of  values  for  the  other  domain  attribute. 

For  example,  assume  that  TOQR  begins  by  considering 
first  the  constraint  on  Weight  (rather  than  the  one  on  Price). 
Then  TOQR  starts  with  Q'Price  =  Weight  <  3  lbs,  which 
matches  a  single  example  in  V  (i.e.,  the  third  one).  As  the 
only  Qf  constraint  that  can  be  added  to  this  query  is  the 
one  on  Price,  it  follows  that  the  final  result  is  Price  < 
$2,  000  A  Weight  <  3  lbs.  It  is  easy  to  see  that  this  two- 
constraint  query  is  less  similar  to  Qf  than  the  four-constraint 
one  created  earlier. 

Steps  2  &  3:  Relaxing  the  failing  query 

In  order  to  complete  the  query  relaxation  process,  TOQR  pro¬ 
ceeds  in  two  steps,  which  are  identical  to  the  ones  performed 
by  LOQR  (Muslea  2004).  First,  it  finds  -  among  the  queries 
generated  in  the  previous  step  -  the  one  that  is  most  simi¬ 
lar  to  Qf.  Second,  it  uses  the  constraints  from  this  “most 
similar”  query  to  relax  the  constraints  in  Qf . 

For  pedagogical  purposes,  let  us  assume  that  when  learn¬ 
ing  to  predict  whether  CPU  >  2.5  Ghz  is  satisfied,  TOQR 
generates  the  query 

Q'cpu-  Price  <  $3,000  f\CPU  >  2.5  GHz  f\ 
Weight  <  4  lbs 

Note  that  two  of  the  values  in  the  constraints  above  are  not 
identical  to  those  in  Qf;  this  is  an  illustration  of  TOQR’s 
ability  to  relax  the  numerical  values  from  the  constraints  that 
could  not  be  added  unchanged  to  Q'  (see  next  section  for 
details). 

It  is  easy  to  see  that  Q'price  is  more  similar  to  Qf  than 
Q'cpu ■  the  only  difference  between  Q'Price  and  Qf  is  that 
the  former  does  not  impose  a  constraint  on  the  Weight; 
in  contrast,  Q'cpu  includes  a  weaker  constraint  on  Price, 
without  imposing  any  constraints  on  display  or  hard  disk 
sizes.  More  formally,  TOQR  uses  the  similarity  metric  from 
(Muslea  2004),  in  which  the  importance/relevance  of  each 
attribute  is  described  by  user-provided  weights. 

To  illustrate  TOQR’s  third  step,  let  us  assume  that,  among 
the  queries  generated  after  applying  TAN  to  D\  —  D$,  the 
one  that  is  the  most  similar  to  Qf  is 

Q'mspiay  ■  Display  >  17"  Price  <  $2, 300 

Weight  <  3.1  lbs  f\  CPU  >  2.5  GHz 

Then  TOQR  creates  a  relaxed  query  Qrix  that  contains 
only  constraints  on  attributes  that  appear  both  in  Qf  and 
Qmspiay -  for  each  of  these  constraints,  Qrix  uses  the  least 
constraining  of  the  numeric  values  in  <5/  and  Q Display  In 
our  example,  we  get 

Qrix  :  Price  <  $2,300  f\CPU  >  2.5  GHz  f\ 

Display  >  17"  ^  Weight  <3.1  lbs 

which  is  obtained  by  dropping  the  original  constraint  on  the 
hard  disk  (since  it  appears  only  in  Qf),  keeping  the  con¬ 
straint  on  CPU  and  Display  unchanged  (Qf  and  Q Display 


have  identical  constraints  on  these  attributes),  and  setting  the 
values  for  Price  and  Weight  to  the  least  constraining  ones. 

The  approach  above  has  two  advantages.  First,  as 
Q Display  is  the  statement  the  most  similar  to  Qf,  TOQR 
makes  minimal  changes  to  the  original  failing  query.  Sec¬ 
ond,  as  the  constraints  in  Qrix  are  a  subset  of  those  in 
Q'mspiay >  and  they  are  at  most  as  tight  as  those  in  Q'Dlsplay 
(some  of  them  may  use  looser  values  from  Qf),  it  follows 
that  all  examples  that  satisfy  Q'Dispiay  also  satisfy  Qrix.  In 
turn,  this  implies  that  Qrix  is  guaranteed  not  to  fail  on  V, 
which  makes  it  unlikely  to  fail  on  the  target  database. 

The  TOQR  algorithm 

As  shown  in  Figure  2,  TOQR  takes  as  input  a  failed  DNF 
query  Qf  fe  Ci  \J  Ci  \]  . . .  \J  Cn  and  relaxes  its  disjuncts  Ck 
independently  of  each  other  (for  a  DNF  query  to  fail,  all  of 
its  disjuncts  must  fail).  Each  disjunct  Ck  is  a  conjunction  of 
constraints  imposed  on  (a  subset  of)  the  domain  attributes: 

Ck  =  Constr(Ai1)  /\  Constr(Ai2 )  A  •  •  ■  A  Constr(Aik). 

We  use  the  notation  Constrck{Aj)  to  denote  the  constraint 
imposed  by  Ck  on  the  attribute  A  r  In  this  paper,  we 
consider  constraints  of  the  type  Attr  Operator  NumVal, 
where  Attr  is  a  domain  attribute,  NumVal  is  a  numeric 
value,  while  Operator  is  one  of  <,  <,  >,  or  >. 

As  we  have  already  mentioned,  TOQR’s  second  and  third 
steps  (see  Figure  2)  are  identical  to  the  ones  in  LOQR.  As  the 
intuition  behind  them  was  presented  in  the  previous  section, 
for  a  formal  description  of  these  steps  we  refer  the  reader  to 
(Muslea  2004).  In  the  remainder  of  this  paper  we  focus  on 
TOQR’s  first  step,  which  represents  our  main  contribution. 

Step  1:  Extracting  the  domain  knowledge 

TOQR  uses  a  dataset  V  to  discover  the  implicit  relationships 
that  hold  among  the  domain  attributes.  This  is  done  by  learn¬ 
ing  to  predict,  for  each  attribute  Aj  in  Ck,  “what  does  it 
take’’  for  Constrck(Aj)  to  be  satisfied;  then  the  learned  in¬ 
formation  is  used  to  generate  a  query  that  does  not  fail  on  V 
and  contains  Constrck  (Aj),  together  with  as  many  as  possi¬ 
ble  of  the  other  constraints  in  Ck- 

As  shown  in  Figure  3  (see  ExtractDomainKnowledgeO), 
for  each  attribute  Aj  in  Ck,  TOQR  proceeds  as  follows: 

1.  it  creates  a  copy  Dj  of  V;  in  each  example  in  Dj , 
Aj  is  set  to  yes  or  no,  depending  on  whether  or  not 
Constrck  (Aj)  is  satisfied.  This  binary  attribute  Aj  is  then 
designated  as  Dj ’s  class  attribute. 

2.  it  applies  TAN  to  Dj,  thus  learning  the  domain’s  causal 
structure,  which  is  expressed  as  a  restricted  Bayesian  net¬ 
work  (each  non-class  node  has  as  parents  the  class  at¬ 
tribute  and  at  most  another  node). 

3.  it  uses  the  learned  Bayesian  network  to  generate  a  query 
(see  “BN2Query()’’)  that 

-  does  not  fail  on  V,  which  also  makes  it  highly  unlikely 
to  fail  on  the  target  database; 

-  is  as  similar  as  possible  to  the  original  disjunct  Ck- 

“BN2Query()”  starts  with  an  empty  candidate  query  Q' ,  to 
which  it  tries  to  add  as  many  as  possible  of  the  constraints  in 


Given: 

-  a  failed  DNF  query  Qf  =  Ci  \/  C2  V  •  •  •  V  Cn 

-  a  small  dataset  V  representative  of  the  target  database 

RelaxedQuery  =  0 

FOR  EACH  of  Q' s  failing  conjunctions  Ck  DO 

-  Step  1:  Queries  =  ExtractDomainKnowledgefCfc,  V) 

-  Step  2:  Refiner  =  Find M os t S i in i  1  a r ( Ck ,  Queries) 

-  Step  3:  RelaxedC  on  junction  =  RefinefOc,  Refiner) 

-  RelaxedQuery  =  RelaxedQuery  \J  RelaxedC  on  junction 


Figure  2:  TOQR  independently  relaxes  each  conjunction. 


Ck  or  their  relaxations.  As  shown  in  Figure  3,  “BN2QueryO” 
is  a  3-step  iterative  process.  First,  it  detects  all  the  parent¬ 
less  nodes  in  the  network  (in  the  first  iteration,  it  will  be 
only  the  class  node).  Second,  it  sorts  these  nodes  accord¬ 
ing  to  the  effect  that  they  have  on  the  coverage  of  Q'  (i.e., 
how  many  examples  in  V  would  satisfy  Q'  if  Ck  s  constraint 
on  that  attribute  is  added  to  Q').  Third,  it  greedily  adds  to 
Q'  the  constraints  on  the  parentless  nodes,  starting  with  the 
ones  that  lead  to  higher  coverage.  If  adding  Constrck  (^4) 
to  Q'  leads  to  the  failure  of  the  new  query,  A  is  added  to 
the  Retry Attribs  list;  in  a  second  pass,  “BN2Query()”  tries 
to  relax  the  constraints  on  A  by  changing  its  numeric  value 
by  one,  two,  or  three  standard  deviations  (these  statistics  are 
computed  from  V).  Finally,  the  parentless  nodes  and  their 
out-going  arcs  are  eliminated  from  the  network,  and  the  en¬ 
tire  process  is  repeated  until  all  the  nodes  are  visited. 

Experimental  results 

We  empirically  compare  TOQR  with  LOQR  and  two  base¬ 
lines,  AddOne  and  DropOne.  AddOne  is  identical  to  TOQR, 
except  for  adding  the  constraints  in  an  arbitrary  order  (rather 
then  by  exploiting  the  learned  Bayesian  structure).  DropOne 
starts  with  the  original  query  and  arbitrarily  removes  one 
constraint  at  a  time  until  the  resulting  query  does  not  fail. 

The  Datasets  and  the  Setup 

We  follow  the  experimental  setup  that  was  proposed  for 
LOQR  s  evaluation  (Muslea  2004),  with  two  exceptions. 
First,  as  in  many  real-world  applications  one  can  rarely  get 
a  dataset  V  that  consists  of  more  than  a  few  dozen  exam¬ 
ples,  we  consider  only  datasets  V  of  at  most  100  examples 
(also  remember  that  LOQR  performs  inadequately  on  such 
small  datasets).  Second,  we  use  only  five  of  the  six  datasets 
used  for  LOQR’s  evaluation:  Laptops,  Breast  Cancer  (Wiscon¬ 
sin),  Pima,  Water,  and  Waveform.  This  is  because  the  sixth 
dataset,  LRS,  has  a  large  number  of  attributes  (99  versus 
5,  10,  8,  21,  38,  respectively),  which  leads  to  slow  running 
times  (remember  that  for  each  query  relaxation,  LOQR  and 
TOQR  invoke  their  respective  learners  -  C4.5  and  TAN  -  once 
for  each  domain  attribute). 

For  each  of  the  five  domains,  we  use  the  seven  failing 
queries  proposed  in  (Muslea  2004).  We  consider  datasets  V 
of  sizes  10,  20,  . . . ,  100  examples;  for  each  of  these  sizes, 
we  create  20  arbitrary  instances  of  V.  Each  algorithm  uses 
V  to  create  a  relaxed  query  Qr,  which  is  then  evaluated 
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Figure  4:  Similarity:  how  similar  is  the  relaxed  query  to  the  failed  one? 


ExtraetDomainKnowledge  (  conjunction  Ck,  dataset  V) 

-  Queries  =  0 

FOR  EACH  attribute  Aj  that  appears  in  Ck  DO 

-  create  a  binary  classification  dataset  Dj  as  follows: 

-  FOR  EACH  example  ex  £  V  DO 

-  make  a  copy  ex'  of  ex 

-  IF  ex' .Aj  satisfies  Constrck{Aj) 

THEN  set  ex'  .Aj  to  “yes” 

ELSE  set  ex'  .Aj  to  “no” 

-  add  ex'  to  Dj 

-  designate  Aj  as  the  (binary)  class  attribute  of  Dj 

-  apply  TAN  to  Dj,  with  BNj  being  the  learned  Bayesian  network 

-  Queries  =  Queries  (J  BN2Query(D,  BNj,  Ck) 

-  return  Queries 

BN2Query(  dataset  V,  TAN  network  BN,  conjunction  Ck  ) 

-Q'  =  0 

WHILE  there  are  unvisited  nodes  in  BN  DO 

-  let  Nodes  be  the  set  of  parentless  vertices  in  BN 

-  let  Cands  =  {Q;|VA;  6  Nodes,  Qi  =  Q'  A  Constrck  (Ai)} 

-  let  Match(Qi)  ={i£  D\Satisfies(x,  Qi)} 

-  sort  Cand  in  the  decreasing  order  of  M atch(Qi),  Qi  £  Cand 

-  let  Retry Attribs  =  0 

FOR  EACH  Qi  in  the  sorted  Cands  DO 

IF  Q'  f\  Constrck  (A,)  does  not  fail  on  T>  THEN 
-Q'  =  Q'  /\  ConstrCk  (A;) 

ELSE  Retry  Attribs  =  Retry  Attribs  [J  {At} 

FOR  EACH  A  £  Retry  Attribs  DO 

-  let  RlxConstr  =  RelaxConstraint(A) 

IF  Q'  f\  RlxConstr  does  not  fail  on  T>  THEN 

-  Q'  =  Q'  A  RlxConstr 

-  remove  from  BN  all  Nodes  and  the  arcs  leaving  them 

-  return  Q' 


Figure  3:  Extracting  domain  knowledge  by  using  the  learned 
structure  of  the  Bayesian  network  to  generate  non-failing  queries. 


on  a  test  set  that  consists  of  all  examples  from  the  target 
database  that  are  not  in  V.  For  each  size  of  V  and  each  of  the 
seven  failing  queries,  each  algorithm  is  run  20  times  (once 
for  each  instance  of  V):  consequently,  the  reported  results 
are  the  average  of  these  140  runs. 

The  Results 

In  our  experiments,  we  focus  on  two  performance  measures: 

-  robustness :  what  percentage  of  the  failing  queries  are  suc¬ 
cessfully  relaxed  (i.e.,  they  don’t  fail  anymore)? 

-  similarity:  how  similar  to  Q f  is  the  relaxed  query?  We  de¬ 
fine  the  similarity  between  two  conjunctions  C  and  C  as 
the  average  -  over  all  domain  attributes  -  of  the  attribute- 
wise  similarity 

GArm  ,  (C  C 1  —  \\Valuec(Aj)  —  Valuec/  (Aj)| 

Aj\  i  )  maxV  aluer>(Aj)—minV  aluer>{Aj) 

(by  definition  ,  if  an  attribute  appears  only  in  of  the  con¬ 
junctions,  SiniAj  ( C ,  C')  =  0). 

Figures  4  and  5  show  the  similarity  and  robustness  results 
on  the  five  domains.  TOQR  obtains  by  far  the  best  similarity 
results:  on  four  of  the  five  domains  its  similarity  levels  are 
dramatically  higher  than  those  of  the  other  algorithms;  the 
only  exception  is  Breast  Cancer,  where  AddOne  performs 
slightly  better.  TOQR  is  also  extremely  robust:  on  four  of 
the  five  domains,  it  succeeds  on  more  than  99%  of  the  140 
relaxation  tasks  (i.e.,  20  distinct  training  sets  for  each  of  the 
seven  failed  queries);  on  the  fifth  domain.  Water,  TOQR  still 
reaches  a  robustness  of  97%. 

Overall,  TOQR  emerges  as  a  clear  winner.  DropOne, 
which  is  just  a  strawman,  performs  poorly  on  all  domains. 
The  other  two  algorithms  score  well  either  in  robustness  or 
in  similarity,  but  at  the  price  of  a  poor  score  on  the  other 
measure.  For  example,  in  terms  of  robustness,  LOQR  is  com¬ 
petitive  with  TOQR  on  most  domains;  however,  on  four  of 
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Figure  5:  Robustness:  what  percentage  of  the  relaxed  queries  are  not  failing? 


the  five  domains,  LOQR’s  queries  are  only  half  as  similar  to 
Qf  as  the  TOQR  generated  queries. 

Finally,  let  us  emphasize  an  unexpected  result:  when 
trained  on  a  datasets  V  of  at  most  30  examples,  TOQR  typi¬ 
cally  reaches  a  robustness  of  99-100%;  however,  as  the  size 
of  V  increases,  the  robustness  tends  to  decrease  by  1-3%. 
This  is  due  to  the  fact  that  -  in  larger  Vs  -  there  may  be  a  few 
“outliers”  that  mis-lead  TOQR.  We  analyzed  TOQR’s  traces 
on  these  few  unsuccessful  relaxations,  and  we  noticed  that 
such  atypical  examples  (i.e.,  no  similar  examples  exist  in 
the  test  set)  may  lead  to  TOQR  greedily  adding  to  the  relaxed 
query  Q'  a  constraint  that  causes  its  failure  on  the  test  set. 
As  a  few  straightforward  strategies  to  cope  with  the  problem 
failed,  this  remains  a  topic  for  future  work. 

Conclusions 

We  have  introduced  TOQR,  which  is  an  online,  query-driven 
approach  to  query  relaxation.  TOQR  uses  a  small  dataset  to 
learn  the  domain’s  causal  structure,  which  is  then  used  to  re¬ 
lax  the  failing  query.  We  have  shown  that,  even  when  trained 
on  a  handful  of  examples,  TOQR  successfully  relaxes  more 
than  97%  of  the  failing  queries;  furthermore,  it  also  gener¬ 
ates  relaxed  queries  that  are  highly  similar  to  the  original, 
failing  query.  In  the  future,  we  plan  to  create  a  mixed  ini¬ 
tiative  system  that  allows  the  user  to  explore  the  space  of 
possible  query  relaxations.  This  is  motivated  by  the  fact  that 
a  user’s  preferences  are  rarely  cast  in  iron:  even  though  ini¬ 
tially  the  user  may  be  unwilling  to  relax  (some  of)  the  origi¬ 
nal  constraints,  often  times,  she  may  change  her  mind  while 
browsing  several  (imperfect)  solutions. 
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